Chris Pollett >
Old Classses > |
HW#5 --- last modified February 06 2019 04:16:20..Due date: May 13
Files to be submitted: Purpose: To gain experience with index compression methods and a variety of relevance measure. To learn about map reduce algorithms. Related Course Outcomes: The main course outcomes covered by this assignment are: CLO4 -- Give an example of how a posting list might be compressed using difference lists and gamma codes or Rice codes. CLO5 -- Demonstrate with small examples how incremental index updates can be done with log merging. CLO7 -- Know at least one Map Reduce algorithm (for example to calculate page rank). Specification: Do the following problems and submit them in Problems.pdf which you should include in your Hw5.zip folder:
For the coding part of the homework I would like you to split the search_program.php that we have been developing this semester into two programs. index_program.php and query_program.php. The first would be run from the command line with a syntax like: php index_program.php path_to_folder_to_index index_filename Here path_to_folder_to_index should be the path to a folder filled with plain text documents to index. These documents will be numbered 00.txt, 01.txt,..., 10.txt, 11.txt, 12.txt, ...etc. You can assume the folder has at most 100 files. You can treat the file name excluding the ".txt" file extension as the doc id. Your program on input such as the above should then write an index to index_filename. For this index, the dictionary should only store stemmed versions of words, and should take a dictionary-as-string approach to its layout. Posting lists should be stored as gamma-code compressed delta lists of offsets into a document map. A document map entry should consist of document id, a map entry length, document length, sorted list of distinct stemmed terms in the document and their frequencies. Alternatively, (choose one or the other approach) a posting can be a gamma-code compresses delta list offset followed by a gamma-coded frequency, and the document map entries are just document id's followed by document lengths. Your query program will be run from the command line like: php query_program.php index_filename query relevance_measure Here index_filename is the name of an index file that might have been produced by your index_program.php, query is some query to run against this index file, and relevance_measure is one of BM25 and your choice of DFR, LMJM, LMD. You should have a readme.txt file which besides listing your team members says which of these three relevance measures you chose. Your program on the above input should use the index and compute a conjunctive query of the terms in query and score the resulting documents using the provided relevance measure. It should then output these in a descending order in a format usable by trec_eval. You should include with your project a test subfolder, which should have plain text documents with names as described above. Using this test set do some experiments to compare the measure you chose against BM25 using trec_eval. Write up your results in experiments.pdf. Point Breakdown
|